Toggling a Genetic Switch Using Reinforcement 



Learning 

Aivar Sootla, Natalja Strelkowa, Damien Ernst, Mauricio Barahona, Guy-Bart Stan 



Abstract — In this paper, we consider the problem of optimal 
exogenous control of gene regulatory networks. Our approach 
consists in adapting and further developing an established rein- 
forcement learning algorithm called the fitted Q iteration. This 
algorithm infers the control law directly from the measurements 
of the system's response to external control inputs without the 
use of a mathematical model of the system. The measurement 
data set can either be collected from wet-lab experiments or 
artificially created by computer simulations of dynamical models 
of the system. The algorithm is applicable to a wide range of 
biological systems due to its ability to deal with nonlinear and 
stochastic system dynamics. To illustrate the application of the 
algorithm to a gene regulatory network, the regulation of the 
toggle switch system is considered. The control objective of this 
problem is to drive the concentrations of two specific proteins to 
a target region in the state space. In our companion paper, we 
take a closer look at the reference tracking problem from the 
reinforcement learning point of view and consider the generalised 
repressilator as an example. 

Index Terms — Reinforcement learning, synthetic biology, fitted 
Q iteration, regulation, gene regulatory networks 
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Fig. 1. A schematic depiction of the exogenously controlled genetic toggle 
switch. The green circle represents the Lad gene and the red circle represents 
the TetR gene. The arrows with flat ends represent repression of one gene by 
another. In the steady-state only one of the genes can be upregulated (or 
switched on). The goal is to toggle one of the genes, i.e., drive this gene from 
its downregulated mode to its upregulated one. 



I. Introduction 

Synthetic biology aims at the (re-)design of biological func- 
tions in living organisms for their use in various applications 
such as bioengineering, bioremediation and energy [l'|. This 
is typically realised via the insertion of foreign genes inside 
a host cell (e.g., a bacterium E. coli). The expression of the 
foreign genes inside the host cells imposes de facto a burden 
on the native processes of the host cells. A high burden induces 
severe intracellular perturbations and can decrease cellular 
growth rate. This in turn disrupts the intended behaviour of 
synthetic biology gene networks [2j. Hence, it is highly desir- 
able to develop means for controlling gene networks so as to 
efficiently enable the designed behaviour while simultaneously 
minimising the burden induced by this behaviour on the host 
cells. 

The current biotechnology state-of-the-art allows us to 
quantitatively measure and interact with gene regulatory net- 
works. Quantitative in vivo estimates of gene networks' states 
(outputs) can be obtained via fluorescent markers Q, flU 
(e.g., green fluorescent protein, GFP or red fluorescent protein, 
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mCherry). A typical input is a targeted induction of the 
gene expression, which can be achieved by, e.g., conditional 
gene knock outs 0, j6'|, heat shocks |7| or monochromatic 
light pulses |0D, (9). This means that feedback control is 
technologically feasible in vivo. The objective of the control 
method can be minimal time control (i.e., driving the system as 
fast as possible to a target region in the state-space), minimal 
burden control (minimal expression of heterologous proteins), 
or a trade-off between the two, as considered in this paper. The 
control method must reach the objective, while maintaining the 
designed functions of a synthetic gene regulatory network. 

Some control problems in gene regulatory networks were 
successfully addressed 1101 . ifTTI . |[T2l . In all those papers, the 
authors used classical control methods, which infer the control 
law (or the control policy) based on a mathematical model of 
the system. One of the bottlenecks of these approaches is the 
modelling part, which for large gene regulatory networks is an 
extremely hard and lengthy process. Moreover, there are other 
challenges such as stochasticity. Stochasticity is expressed in 
the form of the intrinsic and extrinsic noise during gene ex- 
pression [13]. Transcription and translation processes typically 
involve a few randomly interacting molecules, thus adding 
thermodynamic stochasticity to biochemical interactions. 

The problems with modelling and stochasticity point to- 
wards the use of reinforcement learning methods [14], [15|, 
which infer the control policy based solely on interactions 
with the real system. These methods do not require a physical 
model. Moreover, very few assumptions on the structure of the 
controlled system are made. However, the major advantage of 



2 



the reinforcement learning methods is to some extent their 
drawback. Indeed, these methods require interactions with the 
real system, which implies numerous costly and lengthy wet- 
lab experiments. A solution would be a reinforcement learning 
method, which learns the policy using a single experiment. For 
systems relevant to this paper, however, such a method will not 
be efficient. Indeed, a control policy, which tries to learn and 
control such systems in a single experiment, is generally not 
better than a random control policy |[T6l . In order to address 
these concerns, a hybrid approach is proposed. First, an initial 
control policy is computed using past experimental data and/or 
a mathematical model. After that the control policy is updated 
during the experiment using reinforcement learning methods. 
This approach will be applied to the regulation of the toggle 
switch system schematically depicted in Figure [Tj The control 
objective of this problem is to drive the concentrations of two 
specific proteins to a target set in the state space and remain 
in this set. 

The initial policy is obtained by the Fitted Q Iteration 
algorithm [17]. The algorithm requires only one-step system 
transitions to infer the control policy. A one-step system tran- 
sition is a triplet {n, u,n + }, where n + denotes a successor 
state of the system in state n subjected to input u. Fitted 
Q Iteration can also handle nonlinear and stochastic systems 
and it is sample efficient. One-step transitions can be obtained 
by simulating the mathematical model of the system or using 
past experimental data. Afterwards the policy is updated by 
mixing the online measurements with past observations. The 
Exploration/Exploitation trade-off is addressed using an e- 
greedy policy. 

This paper is organised as follows. Mathematical preliminar- 
ies are described in Section [EI] In order to make the paper self 
contained, the fitted Q algorithm is sketched and different as- 
pects of modelling in gene regulatory networks are discussed. 
The problem of controlling the toggle switch is formulated and 



discussed in detail in Section III The algorithm is described 
in Section |IV| finally, the simulation results are presented in 
Section [V] Reference trajectory tracking for the generalised 
repressilator system is the subject of our companion paper 

ma. 

II. Preliminaries 

A. Modelling in Biology 

At the cellular level chemical reactions depend on thermo- 
dynamical principles, since molecules must collide before a 
reaction can start. Therefore chemical reactions inside living 
organisms are modelled using stochastic calculus. The fol- 
lowing approach to chemical reaction modelling is described 
in detail in lfP9l . Consider a well-stirred system of k species 
in a constant volume and a thermal equilibrium. Assume the 
species are interacting through m reactions. Let N l (t) be the 
number of molecules of species i and i/»j(t) be the change in 
the molecular concentration of species i at time t if the reaction 
j occurs. The bold symbols will be used to denote vectors, 
e.g., N stands for the vector with elements N' 1 . Finally, let 
a,j(n)dt be the probability of reaction j occurring in the next 
infinitesimal interval [t, t + dt], if the number of molecules at 



time t, N(t), is equal to n. The functions dj(-) are called 
propensity functions. The time evolution of the concentration 
of species can be modelled by a Markov stochastic process, 
for which: 

dPr(n,t\n a ,t ) ^ 

Qj. = a i( n ~ v i> Pr ( n ~ v iA n ^ *o)- 

3=1 

Oj(n)Pr(n,t|n .*o) (1) 

where the probability Pr(n,t\n ,t ) stands for Pr(iV(i) = 
n\N(to) = no). This equation is called the chemical master 
equation. Instead of computing particular realisations, one can 
also obtain the expression for the mean. This can be done 
by multiplying ([TJ with n, summing over all n and using 
(h(N(t))) = X) n /i(n)Pr(n,i|no,*o), where (•) stands for 
the mean. 



d(N(t)) 
dt 



3=1 



(2) 



If by assumption N(t) is a deterministic function then (N(t)) 
is equal to N(t). Moreover, equation |2]i describes the species' 
concentrations in the confined volume where the chemical 
reactions take place. 

B. Formulation of the Optimal Control Problem 

Consider a deterministic discrete-time dynamical system 

n t+1 = f(n t ,u t ) (3) 

where Ut is the control input at time t, which belongs to a 
compact set U for every t. In the stochastic case, Markov 
decision processes (MDPs) are typically employed, for which 

Pr (nt+i € JVt+i {»*}fc=o>{ tt *}fc=o 



Pr (n t+1 G N t+1 n t ,u t 

The above relationship means that the probability of the state 
n t +i belonging to the set N t +i does not depend on the entire 
history of the realisation of the states {«/c}fc =0 an d control 
signals {uk}* k=0 , but depends only on the current values n t 
and u t . Under the above Markovian assumption, dynamical 
stochastic systems can be modelled as 

Pr(n t+ i £ N t+1 n u u^j = J f(n t ,u t ,x)dx, 
or in a compact form 



nt+i ~ f(n t ,u t , •)■ 

Here, we slightly abuse the notation by using again the symbol 
/ as in the deterministic system ([3]). This is done, in order 
to signify that these functions describe the dynamics of the 
system whether it is stochastic or deterministic. 

In both cases, consider an optimal control problem, which 
is defined through the minimisation of an infinite sum of 
discounted costs c(n, u). In the deterministic case the problem 
is defined as 



V(n t ) 



mm 

H(-): A»(n.;) = 
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and in the stochastic case as 
V{n t ) = 

K 

min lim E„ /(rit:Utr) ^ 7 l - t c(n t ,M t ) 

where V(n t ) is called the value function and /i(-) is a mapping 
from n to u, which is called the control policy. The cost 
function c specifies the objective of the control problem, 
which in our case is driving the system to a specific region 
in the state-space. In our setting, the control policy should 
be inferred based only on realisations of one-step system 
transitions {n/, ui,n^}, where rt z + is a successor state of the 
system in the state n; and subjected to the input it; (in the 
deterministic case, if the function /(•,•) is known is equal 
to f(ni,ui)). For the purpose of this paper, the function c(-, •) 
is assumed to be known in advance. 

C. Fitted Q Iteration 

A central object of the fitted Q algorithm is the Q function, 
which is introduced as follows: 

oo 

Q{n t ,u t ) = c(n t ,u t ) + min VV~*c(ni, jt(ni)) 

i—t 

Once a Q function is computed, the optimal feedback control 
policy is given as: 

H*(n) = argmin Q(n, it) 
ueu 

Under certain conditions (which form the celebrated optimality 
principle), the Q function can be obtained as the unique 
solution of the following iterative procedure: 

Q k (n,u) = c(n,u) + 7 min Q k _i(f(n,u),u') (4) 

u'eU 

where Qq is equal to c. However, Q is hard to solve in 
general, especially if only the triplets T are given. Therefore 
an approximation Q of the Q function is computed using an 
iterative procedure. Let Qo = c and for every (ni,Ui,nt) in 
T compute: 

Qi{n u ui) = c(n h u{) + 7 min Q (nf, u) 

ueU 

This expression gives Qi only for ni, it; in T, while the 
entire function Qi(-, ■) is estimated by a regression algorithm 
(e.g., EXTRA Trees ||20| ). This can be generalised to an 
iterative procedure, which can be used to obtain a near-optimal 
control policy as outlined in Algorithm [T] The stopping 
criterion can be simply the maximum number of iterations 
Nn, which is chosen such that the number 7^" is sufficiently 
small and the values Q).{ni,ui) are not modified significantly 
for k larger than N- it - Other criteria are described in |17|. Due 
to space limitation a more detailed method description is given 
in our companion paper ifTHll . Note that Algorithm [T] can be 
extended to handle the stochastic case as well ifTTl . 



Algorithm 1 Fitted Q iteration algorithm 

Inputs: Set of triplets T = {ni,ui,n^}f^l, stopping crite- 
rion, cost function c(-, ■) 
Outputs: Policy /t*(n) 
k <- 

Qo(v) <-c(.,0 
repeat 

k <- k + 1 

In order to obtain the values of Qk(-,-) for all {ni,ui} 
in T compute: 

Q k {n h ui) = c(ni,ui) + 7 min Q k -i(nt,u) (5) 

Estimate the function Qk(n,u) using a regression al- 
gorithm with input pairs (rii,Ui) and function values 
Q k (ni,ui). 
until the stopping criterion is satisfied 

Compute ft*(n) = argmin Qk (n, it) 
ueu 



III. System Description and Problem Setting 



The genetic toggle switch introduced in GTl consists of 
the Lad and TetR genes mutually repressing each other (see 
Figure [TJ. The mutual repression means that the increase in the 
protein product concentration of one gene implies the decrease 
in the protein product concentration of the other gene. It can 
be shown that this system typically has two stable fixed points. 
At each of these stable fixed points, one gene is downregulated 
(switched off), while the other gene is upregulated (switched 
on). This means that only one of two genes can be switched 
"on" at the steady state. In the sequel, we will use the 
following numeric references. Gene 1 and protein 1 will denote 
the LacI gene and the protein produced by its expression, 
respectively. Similarly, gene 2 and protein 2 will denote the 
TetR gene and its expression product, respectively. We assume 
that for both genes the protein concentrations are given as 
readouts, for instance via fluorescent markers. On the other 
hand, we assume that the control inputs are implemented as 
light pulses activating a photo-sensitive promoter controlling 
the expression of gene 1. When this photo-sensitive promoter 
is activated through a light pulse the concentration of the 
gene product 1 is increased by a small amount through the 
expression of gene 1. Basic mass-action kinetics of the toggle 
switch result in a high-order model, which is typically reduced 
to a two state model using time scale separation (cf. [22 1). 
This can be done because most of the reactions occur on a 
fast time scale (order of seconds) in comparison with the gene 
expression time scale (order of minutes or even hours). The 
fast time scale includes the mRNA dynamics and the light- 
induction of the promoter [8|. We approximate the action of 
light induction u t as a discrete variable in the set U = {0, 1}. 
The reduced order model has two states, which are the two 
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protein concentrations: 



Algorithm 2 Online learning algorithm 



H+l 



"t+1 



— j^-ttt din} + but 

+ (n 2 )^ 



1 + (ni)«i 



d 2 n 2 



(6) 



where n\ is the concentration of the protein i at time t, 
Ci is the effective rate of synthesis of the repressor i, on 
is the cooperativity coefficient of the repressor i, d{ is the 
degradation rate of protein i, and b is the number of proteins 
produced per unit of time as a result of one light pulse. A 
more realistic model would also have a time-delayed control 
action and a "leaky" transcription on both gene expressions. 
Such extensions require simple modifications of our control 
algorithm, but make the results less transparent and harder to 
analyse. 

The stochastic model, which is also used in the sequel, 
is derived based on the deterministic model d5), i.e., the 
propensities are obtained from the reduced order model 
This results in the following Markov decision process: 

dPr (n^n 2 ^) 



Of 



= [d 2 (n 2 - 1) + .g 2 )Pr(n\n 2 - l,t)+ 

(g! + bu t + dxin 1 - l))Pr(y - l,n 2 ,t)- ^ 
(din 1 + din 2 + gi + g 2 + bu t ) Pr(n , n 2 , t) 



where 



9i 



and g 2 



c 2 



1 + (n 2 ) Q 2 yz 1 + (nj) a i 

The control objective in both deterministic and stochastic 
settings is to start from any initial condition in the vicinity 
of the first fixed point (for which gene 1 is off) and drive the 
system to the vicinity of the second fixed point (for which gene 
1 is on) by applying appropriate control signals to gene 1. At 
the same time the optimal control objective needs to address 
the trade-off between minimum time control and minimum 
burden control. 

IV. Control Algorithm 

Our goal is to develop a control algorithm, which learns how 
to near-optimally control the toggle switch system in a single 
experiment. Toggling the switch can be done experimentally 
in a couple of hours and the fastest measurement sampling is 
in the order of one minute. This gives at most 200 samples 
in a single trajectory. Learning a near-optimal control policy 
for toggling the switch with such limited amount of data is 
an extremely hard problem to solve. To tackle this issue, 
we propose to first learn a "rough approximation" of the 
control policy obtained by applying Algorithm [TJ to one-step 
system transitions F artificially generated from simulations 
of a mathematical model. Afterwards the policy is fine-tuned 
by mixing the online measurements with past observations F ' . 
The Exploration/Exploitation trade-off is addressed using an 
e-greedy policy. 

Our approach is outlined in Algorithm]^ Let QAig [!](•>•) be 
the approximation of the Q function obtained by Algorithm [TJ 
While experiment is running new input-output samples are 
collected in F nC w through direct interaction with the real 



Inputs: Set F = {ni,ui,nf}f^, cost function c(-,-), 
the function QAig []](•, ■)' number of iterations N, function 

Qo(-, •) <- QAig[Tj(-, •) 

F cur ^ F 

while new data is received do 

k <- 

Collect a set of new samples F 11CW = 
{n m , it TO , re TO } m _™ 

F cur ^ h(J~ cur new) 

while k < N do 

k <- k + 1 

In order to obtain the values of Qk(-, •) for all {ni, ui} 
in Jc,,,. compute: 

Q k (n,u) = c(n,u) +7 min Q fe _i(n + , it') (8) 

u'£U 

Estimate the function Qk(n,u) using a regression 
algorithm with input pairs (ni , ui ) and function values 

Q k (m,ui). 
end while 

Qo(v)4-Q*(v) 

end while 



system. Then the approximation of the Q function is updated 
as prescribed in the inner loop of Algorithm [2] After that the 
new set .F n ew is formed and new samples are collected. 

The major challenge of Algorithm [2] is appropriately choos- 
ing the function h(-,-), which combines the sets F cxsx and 
.Fnew As an example, we consider /i(F CU r, F ncw ) = F cur U 
Fnew Such a choice has some drawbacks. If the initial set F 
contains many samples, then the updates in (|8]l will not result 
in significant changes in the policy. This happens because 
the algorithm appreciates equally the samples in F and the 
new sets F ncw , even though the samples in F are artificially 
generated using a mathematical model and the samples in 
F ncw are obtained form the real system. 

The choice of other parameters is guided by the following 
considerations. The number of iterations N is chosen accord- 
ing to the computational constraints in the control system. 
In our simulations, we consider the worst case scenario and 
assume that N is equal to one. 

An important task of such a learning algorithm is a trade- 
off between exploration and exploitation in generation of the 
set Fnew The exploration is required, since the real system 
is essentially unknown to the algorithm and the exploratory 
actions will provide new information. The trade-off policy 
between exploration and exploitation is defined as follows: 



u t = 



axgmin^gj; Qk(n t ,u') with probability e t 



random action 



with probability 1 — et 



where n t is the state measured at time t and Qk{-,-) is a 
current approximation of the Q function. In our experiment et 
is an increasing function of t between zero and one. During the 
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first time samples, the need for new information is typically 
higher, and thus a low value of St should be chosen. 

The structure of the instantaneous cost c(n, u) is chosen as 
follows: 

c(n 1 , n 2 , u) = n + a^n 2 + a u u 

where ot-x, a u are non-negative constants. This choice of cost 
function is based on the fact that, in most biological systems, 
both n and u take only non-negative values. Linear cost 
functions are very useful in such case as was shown in [24] 
for linear time-invariant systems with non-negative states and 
inputs. The weight a 2 influences the switching time; the larger 
the value of a 2 the faster the switch. Since our goal is to toggle 
the genetic switch depicted in Figure [TJ from the situation, 
where n 1 is small and n 2 is large to the opposite situation, 
the constant a 2 must be larger than one. Our target region 
in the state-space is a neighbourhood around the steady-state, 
where n 1 is large and n 2 is small. Therefore, this cost will 
ensure that the target region is reached without specifying this 
region explicitly in the cost. Finally, the term a u u penalises the 
control signal and therefore attempts to minimise the burden 
associated with light-induced gene expression. The choice of 
the weights a 2 and a u dictates the trade-off that exists between 
toggling the switch fast and toggling the switch with a reduced 
gene expression burden. 

In the stochastic case, the direct Gillespie stochastic simu- 
lation algorithm |[T9l is used to compute the value n t+ i based 
on n t . Each trajectory of the algorithm starts at n t and is 
computed until the next sampling time t + 1. The value n t +\ 
is then averaged over one hundred trajectories. In terms of 
Algorithm [TJ the only difference with the deterministic case 
is in the choice of the parameter n m i n . This parameter prunes 
the regression trees and makes the Q function "smoother". 
The choice of this parameter is discussed in the next section. 

The algorithm is implemented in Python using the machine 
learning [25], parallelisation [26|, graphics |27| and scientific 
computation [28 1 toolboxes. 

V. Simulation Parameters and Results 

As an illustration of the benefits of our proposed approach, 
we investigate how our algorithm handles model uncertainty. 
For this, we investigate how a control policy learned in batch- 
mode from data collected from one system can be adapted 
in online mode using data collected from a second system 
to control this latter system. Consider the system |6]i with 
parameters d\ = 1, d,2 = 1, b = 20. The rest of the parameters 
are chosen according to two settings: 

a x = 1 a 2 = 3 ci = 30 c 2 = 10 (9) 
a x = 3 o<2 = 1 ci = 40 c 2 = 60 (10) 

These parameters are almost "opposed" in terms of the corre- 
sponding fixed point values of gene 1 and gene 2, making 
these two settings very different, even though the systems 
are structurally similar. In order to compute the initial Q 
function, we generate 5000 trajectories with 50 samples in 
each trajectory. The discount factor 7 is chosen equal to 
0.75. This parameter represents the trade-off between fast 
convergence of the algorithm and the importance attributed 
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Fig. 2. Control of a deterministic model of the genetic toggle switch. The 
phase plot of the control policy is overlapped with the trajectories; the red 
area marks the region of the state space for which the near-optimal control 
action is equal to one, while the white area marks the region of the state space 
for which the near-optimal control action is equal to zero. The near-optimal 
control policy is here learned from the system with parameter values given 
in Circles and crosses denote the application of the control action. The 
blue trajectory correspond to the system with parameter values given in j5J, 
while the green trajectory corresponds to the system with parameter values 
given in (10) . 



to future control actions. Note that the closer 7 is to 1, the 
more importance is attributed to future control actions. The 
weights in the cost function are chosen such that a 2 is equal 
to 60 and a u is equal to 1. 

First, let us consider the deterministic case. The control 
policy is learned from the system with parameter values given 
in |9 1 and applied to both systems with parameter values in |9} 
and (TO}. As can be seen in Figure |2]both switches are toggled 
successfully. Even when the system with parameter values 
given in ( fTO) is controlled the algorithm is able to fulfil the 
control objective. In order to comment further on the results, 
we can take a closer look at the computed control policy 
depicted in Figure [2] Essentially, a threshold for protein 1 is 
set such that, if its concentration falls below a certain value, 
then a control action is applied to drive up its concentration. 
Eventually, protein 1 becomes dominant and its concentration 
starts to increase due to the dynamics of the toggle switch 
system. At the same time, the concentration of protein 2 starts 
to decrease. The threshold value (i.e., the boundary between 
the red and white regions in Figure [2} can be adjusted through 
the value of the weights appearing in the definition of the 
instantaneous cost function. For example, if the weight on 
the control signal u is increased (respectively decreased), the 
threshold moves to the left (respectively to the right). 

The simulation results in Figure [2] might seem impressive. 
However, such behaviour of the controlled system is not 
always achievable. In our second set of simulations, the 
threshold value is not sufficient to drive the system in the 
region of attraction of the second equilibrium (see upper panel 
of Figure [3j. To address this problem, we use Algorithm [2] 
as described in the previous section. Figure [3] illustrates that 
Algorithm [2] can serve as an adaptation technique that com- 
pensates for model uncertainty and variations in the controlled 
system. Algorithm [2] is able to toggle the switch in the setting 
which proved to be difficult to handle for the purely batch- 
mode algorithm. 
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Fig. 3. Trajectories of the protein dynamics in the offline (batch-mode) and 
the online fitted Q iteration modes. In the upper panel we can see that the 
control policy inferred from input-output data of the system with parameter 
values given in \\()\ is not able to adequately toggle the genetic switch with 
parameter values given in j9j. The system does not converge to its the desired 
state and oscillations are produced instead. In the lower panel, we start with the 
same policy. However, the policy is now adjusted using Algorithm [2] applied 
to newly obtained data. The red circles in both panels represent the control 
action obtained from the available control policy. The blue crosses are the 
exploratory actions, which are applied according to the e-greedy strategy. 
This approach manages to provide the algorithm with a sufficient amount 
of information in order to solve the optimal control problem of efficiently 
toggling the switch. 

The stochastic case simulation setting is very similar to the 
deterministic one. For the stochastic case, the parameter n m i n 
is chosen equal to 5, which provides a control policy similar 
to the deterministic case depicted in Figure [2] In general, 
the parameter n m i n can be chosen automatically using cross- 
validation methods (cf. |29l). The simulation results in the 
stochastic setting are depicted in Figure |4] The simulation 
is performed using direct Gillespie stochastic simulation al- 
gorithm. At every time instance t, one hundred trajectories 
starting at n t are computed until the next time instance t + 1, 
and the value n t+ i is then averaged over these trajectories. The 
stochastic simulation setting is similar to the deterministic one 
depicted in Figure [3] Unlike the simulation in Figure [5] for 
this stochastic simulation the toggle is switched; however, this 
is just one realisation. On average the situation will be similar 
to the deterministic case. 

VI. Discussion and Conclusion 

A framework for data-based inference of feedback control 
laws is presented. We show that this framework can efficiently 



Stochastic control 




50 100 150 200 250 300 



Time (a.u.) 

Fig. 4. Control of a stochastic model of the genetic toggle switch. The curves 
represent the average trajectories obtained when the control policy is learned 
from the system with parameter values given in j!0| and then applied to the 
system with parameter values given in {9}. The simulation is performed using 
direct Gillespie stochastic simulation algorithm. At every time step t, one 
hundred trajectories starting at nt are computed until the next time instance 
t -f 1, and the value nt+i is then averaged over these trajectories. 

handle the control of gene regulatory networks. The objective 
of the optimal control problem is chosen as a trade-off between 
minimum time control and minimum burden control. 

The major feature of the presented control algorithm is its 
adaptive nature. The algorithm computes an initial control 
policy using a batch-mode reinforcement learning method. The 
input to the method are one-step transitions of the system, 
which can be obtained using the experimental data and/or sim- 
ulation of a mathematical model. Then this policy is adapted 
to the real system based on newly obtained measurements. 
The algorithm, however, does not take into account the a 
priori knowledge that it will be applied to a different, but 
structurally similar system. A correct exploitation of structural 
similarity between the learned from and applied to systems 
may significantly improve the performance of the presented 
algorithm. This constitutes one of the main directions for 
future work that is currently under investigation. 
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